{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\">\n",
    "## Открытый курс по машинному обучению. Сессия № 2\n",
    "\n",
    "### <center> Автор материала: Александр Уваров (@uvarov)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <center> Индивидуальный проект по анализу данных </center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Задача спрогнозировать размер следующего платежа пользователя в приложении. Попытка решения задачи регрессии не дала достаточно хороших результатов, поэтому сведем ее к задаче бинарной классификации.\n",
    "\n",
    "**План исследования**\n",
    " - Описание набора данных и признаков\n",
    " - Первичный анализ признаков\n",
    " - Первичный визуальный анализ признаков\n",
    " - Закономерности, \"инсайты\", особенности данных\n",
    " - Предобработка данных\n",
    " - Создание новых признаков и описание этого процесса\n",
    " - Кросс-валидация, подбор параметров\n",
    " - Построение кривых валидации и обучения \n",
    " - Прогноз для тестовой или отложенной выборки\n",
    " - Оценка модели с описанием выбранной метрики\n",
    " - Выводы"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Чтение и описание данных"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:31.141794Z",
     "start_time": "2017-11-12T00:23:29.718154Z"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.model_selection import learning_curve\n",
    "from sklearn.model_selection import validation_curve\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "%matplotlib inline\n",
    "sns.set_context(\"notebook\")\n",
    "plt.style.use('ggplot')\n",
    "plt.rcParams['figure.figsize']=(10,5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Набор данных](https://drive.google.com/open?id=1ahOnFNDL6Yv2SnsMvL8SGI6KtSV6lAF7) представляет собой логи платежей пользователей Моего Мира в приложениях."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:31.381391Z",
     "start_time": "2017-11-12T00:23:31.144179Z"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('../../data/billing_data.csv', index_col=0)\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:31.491566Z",
     "start_time": "2017-11-12T00:23:31.383377Z"
    }
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Описание признаков**\n",
    "\n",
    "+ **amount** - размер платежа в мейликах (около 1 рубля). На ее основе будет получена целевая переменная.\n",
    "+ **app_id** - идентификатор приложения\n",
    "+ **service_id** - идентификатор типа оплаты\n",
    "+ **bill_time** - дата оплаты\n",
    "+ **city_id** - город пользователя (идентификатор)\n",
    "+ **reg_time** - дата регистрации профиля\n",
    "+ **sex** - пол (0 - женщины, 1 - мужчины)\n",
    "+ **music_count** - количество аудиозаписей в профиле пользователя\n",
    "+ **video_count** - количество видеозаписей в профиле пользователя\n",
    "+ **region_id** - идентификатор географического региона\n",
    "+ **photo_count** - количество фотографий в профиле пользователя\n",
    "+ **apps** - список приложений пользователя\n",
    "+ **country_id** - идентификатор страны\n",
    "+ **friends_count** - количество друзей у пользователя\n",
    "+ **birthday_time** - день рождения пользователя\n",
    "\n",
    "Неиспользуемые:\n",
    "+ **индекс** - идентификатор платежа\n",
    "+ **user_id** - идентификатор пользователя\n",
    "+ **service_name** - название типа оплаты\n",
    "+ **country_name** - страна пользователя\n",
    "+ **city_name** - город пользователя\n",
    "+ **region_name** - название географического региона\n",
    "\n",
    "Особенность данных в том, что мы имеем много категориальных признаков и относительно мало численных."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Преобразование данных\n",
    "Избавимся от некоторых признаков и записей:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:31.586668Z",
     "start_time": "2017-11-12T00:23:31.494056Z"
    }
   },
   "outputs": [],
   "source": [
    "df.drop('city_name', axis=1, inplace=True)\n",
    "df.drop('region_name', axis=1, inplace=True)\n",
    "df.drop('country_name', axis=1, inplace=True)\n",
    "\n",
    "df.dropna(subset=['birthday_time', 'reg_time'], inplace=True)\n",
    "df = df[(df['birthday_time'] != '') & (df['birthday_time'] != '01-01-1000 00:00')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Будем прогнозировать оплату в конкретном приложении, но без учета сервиса. Поэтому уберем соответсвтующие признаки:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:31.608516Z",
     "start_time": "2017-11-12T00:23:31.589294Z"
    }
   },
   "outputs": [],
   "source": [
    "df.drop('service_name', axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Заполним некоторые пропуски:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:31.642726Z",
     "start_time": "2017-11-12T00:23:31.611261Z"
    }
   },
   "outputs": [],
   "source": [
    "df['city_id'].fillna('0', inplace=True)\n",
    "\n",
    "df['photo_count'] = df['photo_count'].map(lambda x : 0 if x == '' else x)\n",
    "df['photo_count'].fillna(0, inplace=True)\n",
    "df['photo_count'] = df['photo_count'].astype('int')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Преобразуем типы столбцов:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:39.190040Z",
     "start_time": "2017-11-12T00:23:31.645185Z"
    }
   },
   "outputs": [],
   "source": [
    "df['app_id'] = df['app_id'].astype('object')\n",
    "df['city_id'] = df['city_id'].astype('object')\n",
    "df['region_id'] = df['region_id'].astype('object')\n",
    "df['country_id'] = df['country_id'].astype('object')\n",
    "\n",
    "time_columns = ['bill_time', 'reg_time', 'birthday_time']\n",
    "for col in time_columns:\n",
    "    df[col] = pd.to_datetime(df[col])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Кроме того, в данных есть пользователи с датой регистрации из будущего, удалим их:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:39.205864Z",
     "start_time": "2017-11-12T00:23:39.191776Z"
    }
   },
   "outputs": [],
   "source": [
    "df = df[(pd.Timestamp.now() - df['reg_time']) / np.timedelta64(1, 'D') > 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Анализ данных"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-02T09:23:11.640844Z",
     "start_time": "2017-11-02T09:23:11.603012Z"
    }
   },
   "source": [
    "Посмотрим **распределение платежей по юзерам**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:39.261184Z",
     "start_time": "2017-11-12T00:23:39.207503Z"
    }
   },
   "outputs": [],
   "source": [
    "bills_grouping_by_users = df.groupby('user_id')['amount'].sum()\n",
    "bgu_desc = bills_grouping_by_users.describe()\n",
    "print('В среднем за каждый юзер заплатил {} мейликов'.format(np.round(bgu_desc['mean'], 2)))\n",
    "print('Самый платящий юзер оставил {} мейликов, что составило {}% от общей суммы платежей'.format(\n",
    "    int(bgu_desc['max']), np.round(100 * bgu_desc['max'] / bills_grouping_by_users.sum(), 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Накопленная по юзерам сумма платежей:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:39.677740Z",
     "start_time": "2017-11-12T00:23:39.262906Z"
    }
   },
   "outputs": [],
   "source": [
    "bills_users_cumsum = bills_grouping_by_users.sort_values(ascending=False).cumsum()\n",
    "plt.plot(range(bills_grouping_by_users.shape[0]), bills_users_cumsum)\n",
    "plt.title('Накопленная по юзерам сумма платежей')\n",
    "plt.xlabel('Пользователи')\n",
    "plt.ylabel('Сумма платежей');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:39.889424Z",
     "start_time": "2017-11-12T00:23:39.679339Z"
    }
   },
   "outputs": [],
   "source": [
    "for index, bill in enumerate(bills_users_cumsum):\n",
    "    bills_percent = 100 * bill / bills_grouping_by_users.sum()\n",
    "    if bills_percent > 80:\n",
    "        users_percent = 100 * index / bills_grouping_by_users.shape[0]\n",
    "        print('{}% прибыли нам приносят {}% пользователей'.format(\n",
    "            int(np.round(bills_percent)), np.round(users_percent, 2)))\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Распределение платежей по приложениям:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:39.900920Z",
     "start_time": "2017-11-12T00:23:39.891073Z"
    }
   },
   "outputs": [],
   "source": [
    "bills_grouping_by_apps = df.groupby('app_id')['amount'].sum()\n",
    "bga_desc = bills_grouping_by_apps.describe()\n",
    "print('В среднем каждое приложение заработало {} мейликов'.format(np.round(bga_desc['mean'], 1)))\n",
    "print('Самое прибыльное приложение заработало {} мейликов, что составило {}% от общей суммы платежей'.format(\n",
    "    int(bga_desc['max']), np.round(100 * bga_desc['max'] / bills_grouping_by_apps.sum(), 1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Накопленная по приложениям сумма платежей:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:40.333546Z",
     "start_time": "2017-11-12T00:23:39.902598Z"
    }
   },
   "outputs": [],
   "source": [
    "bills_apps_cumsum = bills_grouping_by_apps.sort_values(ascending=False).cumsum()\n",
    "plt.plot(range(bills_grouping_by_apps.shape[0]), bills_apps_cumsum)\n",
    "plt.title('Накопленная по приложениям сумма платежей')\n",
    "plt.xlabel('Приложения')\n",
    "plt.ylabel('Сумма платежей');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:40.343756Z",
     "start_time": "2017-11-12T00:23:40.335262Z"
    }
   },
   "outputs": [],
   "source": [
    "for index, bill in enumerate(bills_apps_cumsum):\n",
    "    bills_percent = 100 * bill / bills_grouping_by_apps.sum()\n",
    "    if bills_percent > 80:\n",
    "        apps_percent = 100 * index / bills_grouping_by_apps.shape[0]\n",
    "        print('Среди приложений лидеры играют еще большую роль: {}% прибыли нам приносят всего {}% приложений'.format(\n",
    "            int(np.round(bills_percent)), np.round(apps_percent, 2)))\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Посмотрим распределение платежей между женщинами (0) и мужчинами (1):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:40.501109Z",
     "start_time": "2017-11-12T00:23:40.345472Z"
    }
   },
   "outputs": [],
   "source": [
    "sns.boxplot(y=df[\"sex\"], x=np.log(df[\"amount\"]), orient=\"h\")\n",
    "plt.title('Распределение платежей по гендерному признаку')\n",
    "plt.xlabel('Логарфм размера платежей')\n",
    "plt.ylabel('Пол');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-02T22:45:32.276889Z",
     "start_time": "2017-11-02T22:45:32.270785Z"
    }
   },
   "source": [
    "Как видим, мужчины в среднем платят больше."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Признак **'user_id'** больше нам не понадобится:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:40.552142Z",
     "start_time": "2017-11-12T00:23:40.502715Z"
    }
   },
   "outputs": [],
   "source": [
    "df.drop('user_id', axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Добавление новых признаков"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Новые признаки, связанные с типом покупки:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:40.559093Z",
     "start_time": "2017-11-12T00:23:40.553792Z"
    }
   },
   "outputs": [],
   "source": [
    "mm_app_id = 625595 # id приложения Мой Мир\n",
    "df['mm_app'] = (df['app_id'] == mm_app_id).astype('uint8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Признак количества контента:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:40.587771Z",
     "start_time": "2017-11-12T00:23:40.560841Z"
    }
   },
   "outputs": [],
   "source": [
    "df['content_count'] = df['photo_count'] + df['video_count'] + df['music_count']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Разделим существующие признаки на группы по типу. **Временные** признаки выделили ранее:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:40.605797Z",
     "start_time": "2017-11-12T00:23:40.589843Z"
    }
   },
   "outputs": [],
   "source": [
    "print(time_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Обработаем временные данные:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:41.382242Z",
     "start_time": "2017-11-12T00:23:40.607611Z"
    }
   },
   "outputs": [],
   "source": [
    "time_dfs = []\n",
    "for c in time_columns:\n",
    "    # дни недели\n",
    "    weekday_series = df[c].map(lambda x : x.weekday())\n",
    "    time_dfs.append(pd.get_dummies(weekday_series, prefix=c+'_weekday'))\n",
    "    \n",
    "    # выходные\n",
    "    holiday_series = ((weekday_series == 5) | (weekday_series == 6)).astype('uint8')\n",
    "    holiday_series.name = c + '_holiday'\n",
    "    time_dfs.append(holiday_series)\n",
    "    \n",
    "    # месяц\n",
    "    month_series = df[c].map(lambda x : x.month)\n",
    "    time_dfs.append(pd.get_dummies(month_series, prefix=c+'_month'))\n",
    "    \n",
    "    # число\n",
    "    date_series = df[c].map(lambda x : x.day)\n",
    "    time_dfs.append(pd.get_dummies(date_series, prefix=c+'_date'))\n",
    "    \n",
    "# возраст\n",
    "df['age'] = np.round((pd.Timestamp.now() - df['birthday_time']) / np.timedelta64(1, 'D')).astype('int')\n",
    "\n",
    "# возраст профиля\n",
    "df['profile_age'] = np.round((pd.Timestamp.now() - df['reg_time']) / np.timedelta64(1, 'D')).astype('int')\n",
    "\n",
    "# время оплаты\n",
    "df['bill_time_hours'] = df['bill_time'].map(lambda x : x.hour)\n",
    "df['bill_time_minutes'] = df['bill_time'].map(lambda x : x.minute)\n",
    "\n",
    "time_df = pd.concat(time_dfs, axis=1)\n",
    "df.drop(time_columns, axis=1, inplace=True)\n",
    "time_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Категориальные** признаки:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:41.388575Z",
     "start_time": "2017-11-12T00:23:41.383944Z"
    }
   },
   "outputs": [],
   "source": [
    "categorical_columns = [c for c in df.columns if df[c].dtype.name == 'object']\n",
    "print(categorical_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Применим OHE для нескольких небинарных категориальных признаков. Сначала найдем эти признаки:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:41.825267Z",
     "start_time": "2017-11-12T00:23:41.390286Z"
    }
   },
   "outputs": [],
   "source": [
    "data_describe = df.describe(include=[object])\n",
    "nonbinary_columns = [c for c in categorical_columns if data_describe[c]['unique'] > 2 and not c in ['apps', 'service_name']]\n",
    "nonbinary_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Оценим, сколько признаков получится в результате OHE:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:41.999630Z",
     "start_time": "2017-11-12T00:23:41.827245Z"
    }
   },
   "outputs": [],
   "source": [
    "print(df[nonbinary_columns].describe().loc['unique'])\n",
    "int(sum(df[nonbinary_columns].describe().loc['unique']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Применим OHE:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:23:42.744241Z",
     "start_time": "2017-11-12T00:23:42.001515Z"
    }
   },
   "outputs": [],
   "source": [
    "ohe_dfs = []\n",
    "for c in nonbinary_columns:\n",
    "    ohe_dfs.append(pd.get_dummies(df[c], prefix=c))\n",
    "ohe_df = pd.concat(ohe_dfs, axis=1)\n",
    "df.drop(nonbinary_columns, axis=1, inplace=True)\n",
    "ohe_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Преобразуем признак **'apps'** с использованием OHE:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:17.183033Z",
     "start_time": "2017-11-12T00:23:42.745862Z"
    }
   },
   "outputs": [],
   "source": [
    "has_apps_df = df['apps'].str.get_dummies(',').rename(columns=lambda col : 'has_app_' + col)\n",
    "df.drop('apps', axis=1, inplace=True)\n",
    "has_apps_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-10T09:51:05.362849Z",
     "start_time": "2017-11-10T09:51:05.359734Z"
    }
   },
   "source": [
    "Добавим признак количества установленных приложений:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:17.339060Z",
     "start_time": "2017-11-12T00:24:17.184618Z"
    }
   },
   "outputs": [],
   "source": [
    "df['apps_count'] = has_apps_df.sum(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Признаки \"активности\" пользователя:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:17.349726Z",
     "start_time": "2017-11-12T00:24:17.340789Z"
    }
   },
   "outputs": [],
   "source": [
    "df['activity'] = df['content_count'] + df['apps_count'] + df['friends_count']\n",
    "df['social_activity'] = df['apps_count'] + df['friends_count']\n",
    "df['apps_per_age'] = df['apps_count'] / df['age']\n",
    "df['apps_per_profile_age'] = df['apps_count'] / df['profile_age']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Выделим бинарные признаки:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:17.374118Z",
     "start_time": "2017-11-12T00:24:17.351590Z"
    }
   },
   "outputs": [],
   "source": [
    "binary_columns = ['sex', 'mm_app']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Перейдем к **численным** признакам:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:17.409059Z",
     "start_time": "2017-11-12T00:24:17.375425Z"
    }
   },
   "outputs": [],
   "source": [
    "numerical_columns = [c for c in df.columns if (c not in categorical_columns) and (c not in time_columns) and\n",
    "                    (c not in binary_columns)]\n",
    "print(numerical_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:17.427775Z",
     "start_time": "2017-11-12T00:24:17.410672Z"
    }
   },
   "outputs": [],
   "source": [
    "numerical_columns.remove('amount')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Посмотрим их распределение:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:24.427072Z",
     "start_time": "2017-11-12T00:24:17.429076Z"
    }
   },
   "outputs": [],
   "source": [
    "nc_count = len(numerical_columns)\n",
    "_, axes = plt.subplots(nc_count, 2, figsize=(16, nc_count * 5))\n",
    "\n",
    "for i, col in enumerate(numerical_columns):\n",
    "    sns.distplot(df[col], ax=axes[i, 0]);\n",
    "    sns.distplot(np.log(df[col] + 1), ax=axes[i, 1]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Из графиков можно сделать вывод, что некоторые признаки лучше прологарифмировать:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:24.465199Z",
     "start_time": "2017-11-12T00:24:24.428772Z"
    }
   },
   "outputs": [],
   "source": [
    "log_columns = numerical_columns[:]\n",
    "log_columns.remove('age')\n",
    "log_columns.remove('profile_age')\n",
    "log_columns.remove('bill_time_hours')\n",
    "log_columns.remove('bill_time_minutes')\n",
    "\n",
    "for col in log_columns:\n",
    "    log_column_name = col + '_log'\n",
    "    df[log_column_name] = np.log(df[col] + 1)\n",
    "    df.drop(col, axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-05T23:49:42.195895Z",
     "start_time": "2017-11-05T23:49:42.189601Z"
    }
   },
   "source": [
    "Теперь проведем нормализацию численных признаков:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:24.485705Z",
     "start_time": "2017-11-12T00:24:24.466860Z"
    }
   },
   "outputs": [],
   "source": [
    "scaler = StandardScaler()\n",
    "for col in numerical_columns:\n",
    "    if col not in df.columns:\n",
    "        col += '_log'\n",
    "    scaled_col = scaler.fit_transform(df[col].values.reshape(-1, 1))\n",
    "    df[col] = scaled_col.flatten()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Соберем признаки вместе:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:25.626173Z",
     "start_time": "2017-11-12T00:24:24.487328Z"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.concat([df, time_df, ohe_df, has_apps_df], axis=1)\n",
    "del has_apps_df, ohe_df, time_df\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Выделим целевую переменную:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:25.904088Z",
     "start_time": "2017-11-12T00:24:25.627637Z"
    }
   },
   "outputs": [],
   "source": [
    "y = df['amount']\n",
    "df.drop('amount', axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Сведем задачу к задаче бинарной классификации. Разделим платежи медианой на 2 группы (маленькие платежи и большие):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:24:25.910401Z",
     "start_time": "2017-11-12T00:24:25.905781Z"
    }
   },
   "outputs": [],
   "source": [
    "y = (y > y.median()).astype('uint8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Кросс-валидация, подбор параметров"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Поскольку целевой признак идеально сбалансирован, будем измерять метрику accuracy, которая используется по умолчанию в классификаторах. Сравним бейслайны нескольких алгоритмов классификации:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:25:36.929924Z",
     "start_time": "2017-11-12T00:24:25.912003Z"
    }
   },
   "outputs": [],
   "source": [
    "algs = [\n",
    "    ['Решающее дерево', DecisionTreeClassifier(random_state=0)],\n",
    "    ['Случайный лес', RandomForestClassifier(random_state=0)],\n",
    "    ['Логистическая регрессия', LogisticRegression(random_state=0)],\n",
    "]\n",
    "for alg_name, classifier in algs:\n",
    "    score = np.round(np.mean(cross_val_score(classifier, df, y, cv=3)), 4)\n",
    "    print('{}: {}'.format(alg_name, score))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SVC и GradientBoostingClassifier не удалось проверить за приемлемое время. Лучший результат показала логистическая регрессия, ее и возьмем. Подберем гиперпараметры:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:37:32.767797Z",
     "start_time": "2017-11-12T00:25:36.931847Z"
    }
   },
   "outputs": [],
   "source": [
    "C_range = np.logspace(0, 1, 30)\n",
    "params = {\n",
    "    'penalty': ['l1', 'l2'], \n",
    "    'C': C_range\n",
    "}\n",
    "grid = GridSearchCV(LogisticRegression(random_state=0), params)\n",
    "grid.fit(df, y)\n",
    "print(grid.best_score_, grid.best_params_)\n",
    "best_penalty_gs = grid.best_params_['penalty']\n",
    "best_C_gs = grid.best_params_['C']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Построение кривых валидации и обучения"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:37:32.775358Z",
     "start_time": "2017-11-12T00:37:32.769767Z"
    }
   },
   "outputs": [],
   "source": [
    "def plot_with_err(x, data, **kwargs):\n",
    "    mu, std = data.mean(1), data.std(1)\n",
    "    lines = plt.plot(x, mu, '-', **kwargs)\n",
    "    plt.fill_between(x, mu - std, mu + std, edgecolor='none',\n",
    "    facecolor = lines[0].get_color(), alpha=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:44:24.235702Z",
     "start_time": "2017-11-12T00:38:26.541125Z"
    }
   },
   "outputs": [],
   "source": [
    "lr = LogisticRegression(penalty=best_penalty_gs, random_state=0)\n",
    "val_train, val_test = validation_curve(lr, df, y, param_name='C', param_range=C_range)\n",
    "plot_with_err(C_range, val_train, label='training scores')\n",
    "plot_with_err(C_range, val_test, label='validation scores')\n",
    "plt.xlabel('C')\n",
    "plt.ylabel('Accuracy')\n",
    "plt.legend()\n",
    " \n",
    "best_C = C_range[val_test.mean(axis=1).argmax()]\n",
    "print('Лучшее значение параметра C: {}, что совпадает с результатом GridSearch'.format(np.round(best_C, 3)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "На кривой валидации видно изменение метрики в зависимости от параметра С и его оптимальное значение, при котором метрика на валидации максимальна."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:47:11.469121Z",
     "start_time": "2017-11-12T00:46:36.994152Z"
    }
   },
   "outputs": [],
   "source": [
    "lr = LogisticRegression(penalty=best_penalty_gs, C=best_C, random_state=0)\n",
    "N_train, learn_train, learn_test = learning_curve(lr, df, y, random_state=0)\n",
    "plot_with_err(N_train, learn_train, label='training scores')\n",
    "plot_with_err(N_train, learn_test, label='validation scores')\n",
    "plt.xlabel('Размер датасета')\n",
    "plt.ylabel('Метрика')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-11T00:22:05.143306Z",
     "start_time": "2017-11-11T00:19:00.304Z"
    },
    "collapsed": true
   },
   "source": [
    "По кривой обучения видно, что улучшить модель может увеличение количества данных, т.к. результаты на трейне и валидации продолжают сходиться."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Прогноз для тестовой или отложенной выборки"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:44:58.768358Z",
     "start_time": "2017-11-12T00:44:58.016680Z"
    }
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T00:47:14.271289Z",
     "start_time": "2017-11-12T00:47:11.471784Z"
    }
   },
   "outputs": [],
   "source": [
    "lr = LogisticRegression(penalty=best_penalty_gs, C=best_C, random_state=0)\n",
    "lr.fit(X_train, y_train)\n",
    "predict = lr.predict(X_test)\n",
    "print('На отложенной выборке accuracy = {}'.format(np.round(accuracy_score(y_test, predict), 4)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Выводы"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Построена модель классификатора платежей в приложениях в социальной сети Мой Мир, предсказывающего их размер (большая сумма или маленькая). На отложенной выборке получили долю правильных предсказаний 85%, что является достаточно хорошим показателем.\n",
    "\n",
    "С помощьу полученной модели можно предсказывать прибыль в зависимости от различных входных данных и создавать выгодные акции и предложения для пользователей, увеличивающие доход."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
